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Abstract —A problem of much current practical interest 
is the replacement of the wiring infrastructure connecting 
approximately 200 sensor and actuator nodes in automobiles 
by an access point. This is motivated by the considerable 
savings in automobile weight, simplification of manufactura¬ 
bility, and future upgradability. 

A key issue is how to schedule the nodes on the shared 
access point so as to provide regular packet delivery. In 
this and other similar applications, the mean of the inter¬ 
delivery times of packets, i.e., throughput, is not sufficient 
to guarantee service-regularity. The time-averaged variance 
of the inter-delivery times of packets is also an important 
metric. 

So motivated, we consider a wireless network where an 
Access Point schedules real-time generated packets to nodes 
over a fading wireless channel. We are interested in design¬ 
ing simple policies which achieve optimal mean-variance 
tradeoff in interdelivery times of packets by minimizing the 
sum of time-averaged means and variances over all clients. 
Our goal is to explore the full range of the Pareto frontier 
of all weighted linear combinations of mean and variance 
so that one can fully exploit the design possibilities. 

We transform this problem into a Markov decision process 
and show that the problem of choosing which node’s packet 
to transmit in each slot can be formulated as a bandit 
problem. We establish that this problem is indexable and 
explicitly derive the Whittle indices. The resulting Index 
policy is optimal in certain cases. We also provide upper 
and lower bounds on the cost for any policy. Extensive 
simulations show that Index policies perform better than 
previously proposed policies. 

I. Introduction 

Traditionally, throughput and delay have been used as 
performance metrics to judge quality of service (QoS) 
d|—BZlI- The steady-state variance of inter-delivery times 
of packets is considered as a measure of service regularity 
in J8J. Motivated by cyber-physical systems applications 
serving sensors, we address the problem of achieving an 
optimal “mean-variance trade-off” in the inter-delivery 
times of packets of N clients sharing K channels. 

We consider an access point with K channels shared 
by N clients. The clients desire a high throughput with 
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high service regularity. We can associate a reward function 
— var(Di) with client i, where 0 t is the parameter that 
client i uses to tune its trade-off between its throughput 
-T- (where /), is the mean inter-delivery time between 
packets of client i ) and the service regularity car(/.),), the 
variance of the inter-delivery times for client i. By varying 
Oi one can explore the full range of design freedom 
along the Pareto frontier of all mean-variance tradeoffs. 
In summary, the net function which captures the trade-off 
is, 


where f?.j > 0 is the weight attached to client i, and 0, 
is a tunable parameter permitting full exploration of the 
Pareto frontier. 

Our contributions can be summarized as follows. We 
show how one may obtain tractable decoupled solutions 
for the problem of scheduling the clients by addressing it 
as a Restless Multi-Armed Bandit Problem {gl. In particu¬ 
lar we obtain the Whittle indices in a closed form, which 
yields a very elegant solution based merely on comparing 
the indices of the clients. We also derive upper bounds 
on the achievable performance of any policy. Simulation 
results show that the performance of the obtained Index 
policy is very close to optimal. 

II. Related Works 

The steady-state variance of the inter-delivery times of 
packets of clients as a measure of service regularity has 
been considered in ®. References [HO and 111011 consider 
the scenario where multiple queues are sharing a server 
and deal with the problem of stabilizing the queues while 
ensuring an optimal delay and service regularity. Hi 1(1 . 
IfTHll perform an analysis of the pathwise starvations in 
service for the case of a single-hop multi-user wireless 
network. 

A detailed intoduction to Restless Multi-Armed Bandit 
Problems (RMBP) can be found in 111311 . RMBP and its 
relaxation were first introduced in ]9j . The RMBP model 
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has been used earlier in works such as ||T4| , which consid¬ 
ered the problem of choosing an appropriate channel for 
up and downlink transmissions in multichannel access. 
Reference lfT5ll is another notable work which uses the 
RMBP model and derives index policies for optimizing 
convex holding costs in a multiclass queue. 

We also note that optimality of Index policies has been 
established in certain cases as the population of arms goes 
to infinity lll6ll and extensive simulations have shown 
that Index policies have “good” performance even in the 
finite population regime H15H . 11171 . References 1181 - 
H2T1 consider minimization of variance as an objective 
in Markov Decision Process. 

III. System Model 


solely a function of the system state s. With this set-up, 
the process s (t) becomes a controlled Markov process. 

For a positive discount factor /3 < 1, the ^-discounted 
optimization problem is to design control policy u(t) so 
as to maximize the expected infinite horizon discounted 
reward, 

T / N 

lhninfEy^ i?* (0*1 (s* = 0) - s*) 

_>0 ° t=0 \i=l 

Similarly the average reward problem is to maximize the 
expected infinite horizon time-average reward, 

i T ( N 

iminfE- ^ H R * i 0 * 1 ( s * = °) “ s i) 

_>0 ° 1 t =0 \i=l 




We consider the situation where time has been dis¬ 
cretized into slots, and the duration of a slot corresponds 
to the time taken to attempt a packet transmission. Each 
client is assumed to have one packet at the beginning 
of each slot. In each slot, a scheduler chooses I\ out 
of the N clients, and attempts to deliver their packets. 
Channel unreliability is modeled by supposing that if 
client i is served in slot t, then the packet is delivered 
with probability p i} independent of the past attempts. 
Moreover the service times are independent across clients. 
The scheduler has to choose the K clients transmitted in 
each slot so as to maximize the reward function, 

^2 Ri ( TT “ var ( D i )) . (!) 

where /), and var(Di) are the mean and variance of the 
inter-delivery times of packets for client i in the steady 
state distribution. 

IV. Markov Decision Process Formulation 

The system state at time t is given by the vector 
s(f) := (si(£),... ,sjv(f)), with s*(f) denoting the time 
slots elapsed between the latest delivery of a packet of 
client i, and t. Because time is discretized, the state vector 
s(t) is updated only at the beginning of slot t, and remains 
unchanged within the slot. The state thus evolves as, 

Si ( t ) + 1 if no packet of client i is 
delivered in slot t, 

0 if a packet of client i is delivered in 
slot t. 

The Access Point (AP) takes a decision at the beginning of 
the slot t to grant channel access to K clients by choosing 
a control u(f) e {0, 1} K , Ya w(f) = K> where m(t) = 1 
implies that client i will be granted channel access in slot 
t. The decision can be based on the entire past history of 
the system up to time t. 

The “reward earned” at time t when the system is in 
state s is given by ^ (0jl (s* = 0) — s*), and thus is 


Si{t + 1) — 


It is easily verified that the above reward function reduces 
to, 




2=1 



(ZMA ± i))) i 


(4) 


and thus differs slightly from the original reward func¬ 
tion CO. 


V. Whittle Index 


We will pose the MDP of the previous section as a 
Restless Multiarmed Bandit Problem (RMBP). First we 
briefly describe the RMBP. A detailed discussion can be 
found in gj, ffl3]l, (22|. 

Consider a bandit which has N arms modeled as 
Markov processes. At each time a player can choose to 
play any K < N arms and collect a reward from each 
arm, where the reward is a function of the current state 
of the arm that is played. The time evolution of each arm 
depends on whether it was chosen to play or not; thus 
the bandits (arms) are “restless” and evolve even if they 
are not played. The player has to choose the K arms to 
play at each time, so as to maximize the expected reward. 

A “Whittle” policy, or “Index-based” policy, for the 
RMBP, calibrates each of the N arms by deriving N 
positive functions (called “index functions”) Wi(-),i = 
1 which are defined for each possible value that 

the state of arm i can assume. At time t the policy 
simply chooses to play the K arms having the K largest 
values of Wi(*(£)). After a re-labeling so that Wi(si(f)) > 
kk 2 (s 2 (f)) > M / )v(sAr(£)), the choices at time t are 


Ui{t) 


1 for i = 1,2,..., K, 
0 otherwise. 


The derivation of the functions W t (-) follows the fol¬ 
lowing procedure. Each arm is considered in isolation 
from the rest of the arms, and the reward function is 
now modified so that the player receives, in addition to 
the original reward of the arm, a “subsidy” each time 
that he chooses not to play the arm (chooses “passive 
action”), and the goal once again is to maximize the 
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average reward. After having solved this problem, let us 
denote by II(uj) the set of states that an optimal policy 
chooses to not play arm (stay passive). Then the arm is 
said to be indexable if for any two values of subsidies 
w\,W 2 , we have w\ > w 2 =>■ n(ui 2 ) C n(u>i), and the 
original MDP is said to be indexable if all the N arms are 
indexable. In case the MDP is indexable, the Whittle Index 
as a function of the state value s is defined as the smallest 
value of subsidy that makes an optimal policy choose the 
passive action when the client is in state n, i.e., 

W(n) = inf{«/ : n € n(w)}. (5) 

Thus, the Whittle index measures, in a sense, the “value” 
of an arm as a function of the present state, and the 
Whittle or Index policy chooses those K arms which have 
the highest value amongst the N arms. 

VI. The Client Scheduling Problem is Indexable 

We will consider the /3-discounted MDP, show that it 
is indexable and derive the corresponding Whittle index. 
The results for the average reward MDP will be obtained 
by letting /3 —¥ 1. We begin with a brief description of the 
single-arm (3 discounted reward problem. 

Consider the following single client (3 discounted bandit 
problem parametrized by w and (3. The subscripts are 
suppressed for convenience since the discussion below 
applies to each of the N clients. Thus s(t),p are used 
in place of Si(t),pi. 

There is a single client, whose state at time t, s(t), is the 
time-elapsed-since-last-packet-delivery. At each time-slot, 
we can choose from the following two control actions: 
either attempt the transmission of a packet for it (active), 
or stay idle (passive). The reward earned at time t is 
= —Rs(t) + w + R61{s(t) = 0} if the client chooses 
the passive action of not transmitting, while a reward of 
— Rs(t) + R9t{s(t ) = 0} is earned if client chooses the 
active action of transmitting. If the action at time t is 
active, then s(t + 1), the state at time t + 1, becomes 0 
with probability p, and s(t) + 1 with probability 1 — p. If 
the action at time t is passive, then s(t + 1) = s(t ) + 1. 
The costs are additive over time after discounting by a 
factor /3*. A policy whether to be active or remain passive 
at time t when the system state at time t. is s(t) = s. 

We will prove that there is an optimal policy which is 
of threshold type, i.e. there is a threshold “elapsed time 
since last delivery” T (which depends on (3,w,p ), such 
that the policy which keeps the client passive in slot t if 
s(t) < T, and active if s(t ) > T, is optimal. 

By Ci(T) we will denote the /3-discounted reward 
earned by a policy when the system starts with an initial 
state value of i at time 0, and the policy with threshold 
at T is used. Let r, be the first time that state i is hit, 
i.e. Tj = min{t > 1 : s(t ) = i}. By “reward earned in the 
cycle i —> j —> 0 —» i” we will mean the reward earned by 
the system starting in state i in the time slots 0,..., t;_i, 
while operating under the policy with threshold at j. 


Expressions involving reward-functions belonging to a 
single value of threshold are at times not mentioned as 
a function of threshold. X p is a random variable that 
is geometrically distributed with parameter p. Also, we 
define X := E (3 x v and Y := EA ' p /3 Xp . 

Lemma 1: Consider the single client f3 discounted MDP. 

1) Ci(i + 1) — d(i) is a linear increasing function of the 
subsidy w for all i > 0 It is strictly negative when 
w = 0. 

2) For each n > 0, there exists a unique value of the 
subsidy, denoted W(n), such that c n (n + 1) = c„(?i). 

3) W(n) > W(n — 1); thus W{n) form an increasing 
sequence. 

4) For all values of thresholds T, if j > i > T, then 

Ci(T) > Cj(T). 

Proof: For T > 0, the infinite horizon discounted 
reward earned starting in state i and following a policy 
with threshold T + i is, 


T —1 T -1 

i(i + T) = w' 52 F-' 52 R + 

j=o 3 =o 
x„-i 

, i 

E 


R/3 


- E (i + T + j)(3 j 


3=0 


+ / 3 t (Ef3 Xp ) 

+ f3 T+i (E p x *)a(i + T). 

Thus a(i + T) depends on w as, 


Rd + E - R j) p 1 

3=0 


w J2 P j + wP T (e/?E P' /[ x - P T+i ( E /3 Xp )] 

L 3=0 3=0 J 

1 -P T , 0 T Pd 1-/3' 


+ /V 


1-/3 p/3+1— f3 1-/3 

l-d+pd-d T (l-(3 + p(3 i+1 )\ 


/ l- 


oT+i 


p/3 


(1 - d) (1 - d + P /3 - d T+i+ 1 P) 


pd+i-d 

(6) 


Thus d(i + 1) — Ci(i ) depends on w as, 

T — „ — ++‘i ' which is linear and 

(1 —/3+P/3—p/3 +1 )(1—/3+P/3—p/3*+ 2 ) ’ 

increasing in w. 

Now we consider the case when w = 0. If C\ is the 
cost of cycle z —>- i —>- 0 —i, then it follows via a simple 
coupling argument that the cost of cycle i — >- i -4-1 —>- 0 —»• 
3 + 1, denoted C 2 , is given by, 


x p -i 

C 2 = -Ri + /3Ci - Rp E E 

3=0 

and thus to prove the second result of the first statement, 
we only have to show that 

Ci ~Ri + dCi - R/3E Zflo 1 P j . 

1 - [ 3 { X 1 - f 3 \ 3 X > 
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This is equivalent to showing that, 


Ci > -Ri ■ 


1 -ftX 
1-/3 


x p -i 

Rf3\Ej2 ft 

3=0 


1-ftX 

1-/3 


We observe that —f?i • 1 ft ^ - i?/3 1 ■ 1 

is the reward earned over the cycle z—>-z—>0—>-zif 
one were to modify the original cost function and instead 
charge a penalty of —Ri for value of states s(t) < i 
and a penalty of —Rs(t) if s(t) > i. However since the 
original reward function is = —Rs(t) + R61{s(t) = 0} 
(note that w = 0), a simple coupling argument shows that 
the reward earned is lower with the modified function. 
This completes the proof of first statement. 

Note that from the first statement it follows that c n (n + 
1) — c n (n) is a linear increasing function of w which is 
less than 0 at w = 0. Hence there exists a value of w such 
that the function c n (n+l) — c„(n) vanishes, and moreover 
vanishes at a unique point since the slope of this function 
is strictly positive. This value of w, where the function 
c n (n + 1) — c n (n) vanishes, is W(n). 

Let Ci, C 2 be the costs of cycles n — > n -+ 0 —>■ n and 
n —> n + 1 —> 0 —> n. It is seen that, 


Cn(n) = 


Ci 


l-/3 n X’ + 1 -pnpx' 

Using a coupling argument we obtain, 


C 2 


(7) 


•Vp -1 

C 2 = (W(n) - Rn) + PC 1 - R(3E ^ ft. (8) 

3=0 


Combining (IT]),® and the fact that for w = W(n) we 
have Cn(n) = c n {n + 1), 


Ci (W(n) - Rn) + /3Ci - R0E Ef=o 1 ft 

-=--- or 

1 - P n X 1 - P n +KX ’ ’ 

( x P -i \ 

Ci (1-/3)= W (n) -Rn- R/3E ^ ft (1 - ftX). 

(9) 


Now let us check if under the value of subsidy set to W(n), 
we have c„_i(n) > c„_i(n — 1). If this is the case, then 
from the first statement of this lemma, we will deduce 
that W(n — 1) < W(n). Now, c„_i(n) > c„_i(n - 1) is 
equivalent to showing 

W(n) - R(n - 1) + PCi - /3 n X (W(n) - R(n - 1)) ^ 

1 - 8 n X > 

Ci + ft ~ ft~ lx (W(n) - R(n - 1)) 

1 - /3 n ~ 1 X 

After some algebraic manipulations and using (J9]) it can 
be shown that proving the above inequality is equivalent 
to proving X > 0, which indeed is true. This completes 
the proof of third statement. 

For the fourth statement, using a coupling argument, 
we obtain, Cj(T) = cpT) — R(j — i) Ej^o" 1 ft > an< ^ hence 
Cj (T) < Ci(T). ■ 


Lemma 2: Let the subsidy be w = W(n). Then for the 
single client /3 discounted MDP, 

1) Ci(n) = Ci(n + 1),Vz > 0. 

2) Ci_i(n) > Ci(n), Vz > 1. 

Proof: Firstly recall that for subsidy = W(n), c n (n) — 
c n (n + 1) = 0. Thus for i = 0,1,..., n — 1, 

Ci(n) - Cj(n + 1) = /3 n ~ l (c„(n) - c„(n + 1)) = 0. (10) 

For i >n + 1, 

cpn + 1) - Ci(n) = PX (co(n + 1) - c 0 (n)) = 0, 

where the last equality follows from (HOD . This proves the 
first statement. 

To prove the second result, consider the following cases: 

i) For i > n, Lemma [T] implies that the inequality is 
true. 

ii) For 2 < i < n, denote d, as the cost incurred in the 
cycle n —»■ 0 —y i — 1. Then both a(n), and a(n + 1) 
can be derived in terms of cE When subsidy is equal 
to W(n), we have a(n) = cpn + 1), i.e., 

-0 n ~ i (l-P)d i = (11) 

W(n) (ft n X - P n ~^ + Rft-'n - p n Xi (12) 

+ (/3(3 - x ) ~ ft +lx + /3 n+1 X 2 ) . (13) 

where the first equality follows from statement 1. 
Similarly, a-i(n) — cpn) > 0 is equivalent to 

n—i—1 

]T (W(n) -Ri- Rj) ft + 0*-% > 

3=0 

n — i 

( W(n) -Ri + R- Rj) ft + P n ~ i+1 di 
j =0 

- ft 1 X ( W{n ) — Ri + R), 

i.e., 

-p n ~Pl - ftdi+R 1 ~^ 1 + (W(n) -nR + R) p n ~ l 
- p n X C W{n) - Ri + R) > 0 or 

(i-rxpp-r+'x) 

-^ U ’ 

where the second-last equivalence follows from (HID . 
We note that the last inequality holds trivially for all 
/3 G (0,1) and hence the statement 2 holds for i = 

2 

iii) i = 1. We compare the cost incurred by the system 
starting in state 0 over the cycle 0 —> n —> 0 (say Co) 
with the cost incurred over the cycle j —> n -+ 0 —> j 
when starting in state j ( denoted Cf) via cou¬ 
pling the processes associated with the two systems 
constructed on the same probability space. Clearly 
Co > Ci. Thus co(T) > cpT) for any value of 
threshold T. 
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Lemma 3: The function w +p/3 ( a(T) — co(T)) ( which 
depends on w,i,T) is linear, increasing in w. Also, 


W(n) +p/3(c n+1 (n) - c 0 (n)) = 0 for n = 0,1,.... (14) 


Proof: We consider the following cases: 
i) For i < T, it follows from ([6]) that the function w + 

p/3 (cj(T) — co(T)) depends on w as 


1 — /3 — p/3 + p/3 T 1+1 

- w. 

1 — f3 + p/3 — p/3 T+1 


(15) 


We have 1 — /3 + p/3 — p/3 T+1 > 0, V/3 < 1. Also, 1 — 
(3 — p/3 + pf3 T ~ l+1 > 1 — 2/3 + /3 T ^ I+1 > 0 since the 
function 


1-2/3 + /3 k > 0,Vfc >1,13 & (0,1). 

Thus, in the expression (1151 ) the coefficient of w is 
positive. 

ii) For i > T + 1, we have, 

x p -i 

Ci(T) = E ^ F(-i-j)+Xco(T). 

3=o 

The dependence of cq(T) on w can be obtained 
from ([6]). Combining, w+p/3 (ci(T) — cq(T )) depends 
on w as, 

1-/3 

-?/; 

1 — (3 + p/3 — p(3 T+1 

which has a positive slope with respect to w. 

This completes the proof of first statement. Note that for 
w = W(n), we have 

c n (n + 1) = c„{n). 

This implies 

— Rn + W (n) + /3c n +i (n + 1) = —Rn 
+ f3(pco{n) + (1 -p)c n+ i(n)) i.e. 

W(n) + /3c n+ i = /3 (p(co + (1 -p)c n + i) and so 
W (n) + p/3 (c n+ 1 - c 0 ) = 0. 

Above, in the second implication, we have used the first 
statement of Lemma |2] to remove the dependence of c,(■) 
on the threshold values. ■ 

Theorem 1: For the /3-discounted MDP with subsidy 
w G \W{n),W{n + L)), the policy with threshold at n 
is optimal. Thus the MDP is indexable and W(n) is the 
Whittle index when the state is n. 

Proof: Fix a w G \W{n),W{n + 1)). If the policy is in¬ 
deed optimal, then the Dynamic Programming optimality 
equation would be satisfied. Hence we only need to verify 
the inequality 

- Ri + w + /3c i+ i > -Ri + (3 [(1 - p) c i+ 1 + pc 0 \, 
for i = 0,1,..., n, 

or, equivalently, w + f3p (ci+i - c 0 ) > 0, (16) 


with strict inequality holding if w € (W(n),W(n + 1 )), 
and equality holding for i = n,w = W(n). Similarly for 
* = n + l,n + 2,... we have to verify the inequality 

w + /3p(c i+1 -c 0 ) < 0. (17) 

We will first prove (1161) . We use superscripts to distinguish 
between costs ct calculated under different values of 
subsidy. We have, 

w' + Pp (c"m - Co ) >W{n) + fip (^<"> - C 0 “' ( " ) ) 

= VP (co"' < ”’ - cS’) + PP («"' -»?'<”>) 

> 0 . 


where the first inequality and equality follow from 
Lemma 0 and the last inequality follows from Lemma [2] 


To prove 

(1171) we have, 


+ /3p(cf +1 - c^) < W{n + l)+(3p 

(W(n+ 1) W(n+1)\ 

l H+i c o J 


= PP (c?'<" +11 - 

W(n+1)\ 

C n+ 2 J 


+pp{€r" 

W(n+ 1)\ 

c 0 J 


=ppf:3r i! - 

W(n+1)\ 
c n+2 J 


< 0 , 


where first two steps follow from Lemma [3] and the 
last inequality follows from Lemma [2] This completes 
the optimality of the policy with threshold at Win). 
Following 0 the Whittle index for the state n is thus given 
by 

infjw; : n € n(u;)} = inf{u> : w > W(n)} = W(n), 


where the first equality follows from the first statement 
of Theorem. ■ 

We now proceed to explicitly derive the values of the 
indices W(n). 

Theorem 2: 


W(n) 


fi 


h 


h 

U 

fs 


PWl-/2-/, + /■) . where i 

/ 5 

• ((1 - X) K1 -0)+0\- Y( 1 - f})), 

/3(1 — f3 n ) — /3™n(l — f3) M v , 

-(- { 

l T ^ii-P n x), 

e (i - x ), 

1-/3 n X -pl3(j^j) (1 — X) 

1-/3 

1-/3 + p/3' 


Proof: From (1141) we have, 


W{n) = p/3{c 0 - Cn + 1 ) 
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X„ 


= p/3{c 0 c n - Ey^/3 J ). 

3=0 


Now, 


co c n — 


C 0 -C n 
1 — B n 'KP x p ’ 


08) 


09) 


where Co, C n are the costs over the cycles 0 —>• n —> 0 and 
n —> n —> 0 —>• n. We can compute Co — C n as, 


x„-\ 


C 0 -C„= E ^ (n + j)F (1-/3") 


( 20 ) 


3=o 


+ ( '52{W(n)-j)F (1 — E/3 Xp ) + 6 (\ — (3 Xp ). 

C=o / 

( 21 ) 


Combining (I18I19I20D and setting A = E E ? E 0 (n+i)/3 J , 
we have, 


W(n) = p/3 


A(1 - D + (E™=o (W(n) - j) p j ) (1 - E/3 Xp ) 
1 - /3 n E/3 x p 


E vV i 

^ + 1 _ J > 


or. 


W(n) 


(e;=o^) o-e^)" 

1 - d/9 . -1 —:-'- 

PP 1 - /3"E/3 A V 

a(i - n + (e;=o -i/i 3 ) (i - Ed Xp ) 


-E y p j + 


3=o 


1 - /3 n E/3 x p 

e (i -e/ 3 x v)\ 

1 - p n Ep x p J ’ 


1 -, 


1 - p n x 

Xp-l 

/3" (n + j)P 3 + 


w lzjr + P0_ -/?") - n /3 


n +1 


(1-/3) 


i ?0 


1 - P n X 


(1-/3 ) 2 

and so 


j=i 

wi-^fn) = 

oti 

/ 1 — P n p(l — P n ) — np n+1 (l — p) 

In \ w 1 - P n X (1 - P){ 1 - /3"X) 

- ( l-/3)/3" £ (- + i)/3 3 + ^=#j 


np 

= w -- 

np + 1 

< oo. 


+ Rp(n + rc) + 


Rpd 


2 (np +1) np + 1 


(24) 


which simplifies to, 

W(n) • / 5 = pfi(fi - h - h + /-O- ( 22 ) 

■ 

Theorem 3: The Whittle indices for the average cost 
MDP are given by, 

VT Av S(n) = lim W^(n) = nRp • (— + -—— + + Rpd. 

/3-n V 2 1+P 2 / 

(23) 

Proof: The expression (1231) is easily derived 
from (1221) . It remains to show that the quantities 
VyAvg ( n ) are j nc j eec j whittle indices for the average-cost 
problem. Fix the subsidy to be w, and without loss of 

generality let u> G (V Av §(n), W Av §(n + 1)) . Below we 

use superscripts to exhibit the dependence of the cost on 
/3. Now, 

Coin) = 


Since for each m, W^(m) -)• W Av S( m), it follows 
from Theorem [2] that there exists a /3*H such that the 
policy with the threshold at n is optimal for the single 
client /3-discounted MDP for all /3 € (/3 *(w), 1). However 
since 11111 / 3 - 1 - 1(1 — /3 )cg(n) exists, the policy with thresh¬ 
old at n is also optimal for the average cost problem. 
However since w can assume any value in the interval 
^W Av g(n), W Av g(n + 1 )), the policy with threshold at 
n is optimal for the average cost MDP for each value of 
subsidy w € ^W Av S(n), W /Av 3(n 4 - 1 )). Thus, 

inf{u/ : optimal policy chooses active at n} < FF Av &(n). 

(25) 

Similarly, picking subsidy w < iyAvg ( 

n) shows that the 

active action is not optimal for any value of subsidy w < 
W Av S(n). Hence, 

infjw; : optimal policy chooses active at n} = W Av £(n), 

(26) 

and we obtain that W Av S(n) are indeed the Whittle 
indices for the average cost problem. ■ 

We note that the expression (I24D is the average reward 
earned under the subsidy w and threshold at n. We will 
denote this quantity as C Avg (W, n). 

VII. Bounds on Optimal Reward. 

Lemma 4: For the average cost MDP, the reward ob¬ 
tained under any policy is upper-bounded by the value of 
the following optimization problem: 


N 


max 


J2 R 

2=1 

such that y 




2=1 


DiPi 


<l,Di>0,i = l,...,N. (27) 


Proof: The random reward earned in time steps 
1 , 2 is given by, 

Ni(t) 

- y a(o 2 + 0 ^) 


N 


C{t) :=y 


Ri 


2=1 


1=1 
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where Ni(t) is the number of packets of client i delivered 
by time t and Di(l) is the interdelivery time of 1-th packet 
of client i. Let us assume that the average interdelivery¬ 
time for client i under a policy is equal to 1)^. Thus, 


liminfEC(t) < limsupEC'(t) 

t—too 

< ElimsupC(f) 


t—too 


N 


= E lim sup Ri 


N 




i= 1 


i= 1 
S 2 


eS 4 ) a(o 2 


9iNi(t) 




where the second inequality follows from Fatou’s lemma 
and the last is Jensen’s inequality. Thus solving the op¬ 
timization problem (127D gives a lower bound on the 

performance of any policy. We note that the constraint 

N 

jr— < 1, D, > 0 is simply the capacity of the wireless 

i=1 UiPi 

channel. 


Next we consider the Lagrangian relaxation of 
the RMBP l[9j. For this, we relax the constraint 
of choosing I\ arms at each time, to the 
constraint that one plays K arms on average, i.e., 
j. Total numbers of arms played by time t _ ^ 

t—t oo ^ 

Clearly the maximum possible reward in the relaxed 
problem is greater than or equal to the reward earned 
by any policy for the original RMBP. Also since the Index 
policy is the optimal solution to this relaxed problem ( 
Il9l). its value function serves as an upper-bound for the 
value function of the RMBP. 

Lemma 5: Let C Avg ’‘ be the average reward earned by 
the policy maximizing the single-client average reward 
under the subsidy W (I24D . Then the reward for the 
average cost MDP obtained by any policy is less than or 
equal to. 


N 


inf ^ C^’^W) - W(N - K) 


W> 0 


N 


= inf V PE 
w^>n l * 


riiPi RiPi(n? + ni) 


w> o 
RiPiO. 


riiPi + 1 


riiPi + 1 2 (mpi + 1 ) 

— W(N — K)) , 


= inf 
w>o 


N 


ME 


nm 


n iPi + 1 


+ K — N \ + 


RiPiinj + rij) 

2 (riiPi + 1 ) 


RiPi^i 


riiPi + 1 _ 

where m is such that W G (FR(nj), W{rii + 1)). 


VIII. Optimality of Index Policy 
Now we consider several special cases of interest. 


Theorem 4: Consider the average cost problem for the 
case where all the clients are identical, i.e., Ri = 1 and 
Pi = p for all the clients. The index policy is optimal in 
this case. 

Proof: Firstly we note that in this symmetric case, 
the Index policy serves the client with the largest value of 
the state, i.e. the policy is, “largest time-since-last-service- 
first”. We will prove the result only for the case of two 
clients, each having channel reliability p. The case where 
there are multiple such clients follows in a straighforward 
manner. 

Consider the time-horizon at t. If (si,S 2 ) is the initial 
value of the state vector, and R t ( s) is the maximum 
reward that can be earned when there are t time-slots to 
go, then the Dynamic Programming optimality equation 
becomes, 

Rt [(si, S 2 )] = — (si + S 2 ) + (1 — p)Rt- 1 [(at + 1) S 2 + 1)] 
+ pmax{i? t _i [(0, s 2 + 1)], Rt -1 [(«i + 1,0)]}, 

where the optimal action corresponds to the one max¬ 
imizing the expression on the right hand side. Let us 
assume without loss of generality that s 1 < S 2 - Then 
Rt- 1 [( 0 , s 2 + 1)] < Rt- 1 [(si + 1 , 0 )], which implies that 
the optimal action is to serve client 2 . ■ 

IX. Simulations 

We have carried out simulations to compare the per¬ 
formance of the optimal policy which was obtained via 
the Policy Iteration tool-box in Matlab vs. the Index policy 
which was obtained in Theorem[3j We present three plots 
in Figures [Tl3l In all the cases considered 2 clients share 
a single channel. To obtain Figure |H we fix client l’s 
parameter as pi = .8, 6\ = 3, R\ = 1, while for client 2 we 
fix 0 2 = 3, i ?2 = 1 and vary p 2 from 0 to 1. For Figure [2j 
we fix Client 1 parameters to be pi = .8,9i = 3, Ri = 1 
while for Client 2 we fix p 2 = . 6 , R 2 = 1 and vary the 
value of 9 2 from 1 to 10. To obtain Figure [3j we fix Client 
l’s parameters as pi = . 8 , = 5,i?i = 5, and for Client 

2 we fix the parameters p 2 = . 6 , 9 2 = 5 while varying the 
value of R 2 . 

We observe that Index policy gives near-optimal perfor¬ 
mance in all the cases. 

X. Concluding Remarks 

We have proposed an analytical framework for ex¬ 
ploring the full range of mean vs. variance tradeoffs 
in inter-delivery times in wireless sensor nerworks, i.e. 
Throughput vs. Service Regularity trade-off. The problem 
can be formulated as Restless Multiarmed Bandit Problem 
and indices can be obtained in closed form. Simulations 
indicate near-optimal performance of the resulting Index 
policy. 
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Fig. 1: Reward Optimal Policy vs. Index Policy for pi = 
.8, 6\ = 3, f?i = 1 ,02 = 3, i ?2 = 1, P 2 varying from .1 to 1. 



Fig. 2: Reward Optimal Policy vs.Index Policy for pi = 
.8,9i = 3,f?i = 1, p 2 = -6,R 2 = 1 while 0 2 varies from 1 
to 10. 



Fig. 3: Reward Optimal Policy vs. Index Policy for pi = 
.8, 0\ = 5, f?i = 5,^2 = -6 ,02 = 5 while f? 2 is varied. 
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